UKFin+: Can We Trust AI to Automate Regulatory Compliance?

Rigorous Evaluation for Responsible Deployment

Professor Barry Quinn
Ulster University Business School
Mr Abhishek Pramanick
Queen’s University Belfast
Dr Fearghal Kearney
Queen’s University Belfast
Professor Jesus Martinez Del Rincon
Queen’s University Belfast

2025-11-01

The Promise vs The Reality

The Burden

  • Financial services: £780 billion annually on compliance
  • Manual review: £200-500 per submission
  • FCA supervises: 50,000+ firms
  • Timeline: Weeks to months for complex rules

The AI Promise

  • Automated compliance checking
  • Instant regulatory interpretation
  • 50-70% cost reduction projected
  • RegTech market: $7.9B → $45.8B by 2032

But Should We Trust It?

The Reality Check

  • Can AI accurately extract regulatory requirements?
  • Does AI know when it’s wrong?
  • Which regulations can we safely automate?

The Research Question

Can AI Safely Automate Regulatory Compliance in Finance?

The AI and Law research community increasingly emphasises rigorous evaluation of real systems on actual legal text.

Our Investigation:

  1. Accuracy: How well do different AI architectures extract structured rules?
    • Symbolic knowledge graphs vs neural language models vs hybrids
  2. Reliability: Can AI systems tell when they’re wrong?
    • Do confidence scores correlate with actual correctness?
  3. Deployment: Which regulations can we safely automate?
    • Where does human professional judgement remain essential?

Our Test

  • 6 AI architectures (symbolic, neural, hybrid)
  • Canonical evaluation set: 36 European Fund Classification (EFC) rules (held-out subset of the EFC corpus, used only for testing, not for building the knowledge base)
  • Shared EFC knowledge base: ~190 manually encoded EFC rules in GraphDB, used to build the symbolic KB and as the retrieval corpus for GraphRAG/Logic-LM (evaluation rules not included)
  • Systematic evaluation with reproducible methodology on real regulatory text (not simplified benchmarks)

Finding #1 – Structural Parity Surprise

Traditional Knowledge Graphs vs Modern AI: A Surprising Tie

Recent computational legal studies distinguish between “law-as-code” (formal symbolic representations) and “law-as-data” (statistical pattern learning).

The Architectural Landscape

The Surprise: GraphDB remains structurally strongest (≈88% GED similarity), but modern AI architectures are closer than many would expect (≈72–74%), so there is no simple ‘AI wins’ or ‘AI fails’ story.

But HOW They Achieve It Differs

GraphDB (Symbolic)

Strengths:

  • ✅ Perfect terminology (94% lexical consistency)
  • ✅ Explicit constraint representation
  • ✅ Transparent reasoning

Limitations:

  • ❌ ~190 manually encoded rules
  • ❌ 10 months expert effort
  • ❌ Expensive to extend

LLMs (Neural)

Strengths:

  • ✅ Zero manual encoding
  • ✅ 2 days implementation
  • ✅ Flexible coverage

Limitations:

  • ❌ Severe lexical drift (2-4% consistency—uses different terminology)
  • ❌ Requires canonicalization (post-processing to align terms)
  • ❌ Black-box reasoning

The Trade-Off

This isn’t about one architecture “winning”—it’s about fundamentally different trade-offs between knowledge engineering costs and output processing complexity. Your choice depends on operational context, not raw accuracy.

Finding #2 – The Confidence Trap

When AI Is Most Confident, It’s Often Most Wrong

Parallel from Medical AI

Recent research at Johns Hopkins shows doctors face similar challenges calibrating reliance on AI recommendations—but this is the first systematic documentation of inverse calibration in regulatory compliance automation.

Research on AI in high-stakes medical decisions has documented “overconfidence” problems. But we discovered something more concerning: systematic inverse calibration.

Expected vs Reality

Expected (Good Calibration):

High Confidence → High Accuracy ✅

When model says “90% confident,” accuracy should be ~90%

Reality (Inverse Calibration):

High Confidence → Low Accuracy

When model says “90% confident,” accuracy is often worst

The Discovery: Negative Confidence-Accuracy Correlation

Model Architecture Calibration (r)* Interpretation
Few-Shot LLM -0.545 🚨 INVERSE
GraphRAG (k=10) -0.469 🚨 INVERSE
GraphRAG (k=50) -0.500 🚨 INVERSE
Constrained Generation -0.352 🚨 INVERSE
Logic-LM +0.135 ✅ Positive (weak)
Chain-of-Thought +0.190 ✅ Positive (weak)

*Correlation coefficient: +1 = perfect calibration, 0 = no relationship, -1 = inverse calibration

The Deployment Disaster

Naive confidence-based routing would:

  • ✅ Automate the WORST predictions
    • Model is confident → must be right!
  • ❌ Flag CORRECT predictions for review
    • Model uncertain → better check!
  • Result: Systematically route errors to automation

Example

Rule: “Fund must allocate ≥80% to bonds, ≥30% to equities…”

Model Output:

  • Confidence: 95% 😊
  • Accuracy: WRONG (80%+30%>100%)
  • Naive routing: AUTOMATE THIS! ☠️

Key Insight

This is the first systematic characterization of inverse calibration specifically in regulatory text extraction tasks. It has profound implications for safe deployment.

Finding #3 – The Ambiguity Reality

Ambiguity Isn’t the Exception—It’s How Regulation Works

Legal scholars have long recognised that ambiguity serves essential functions: contextual application, evolutionary interpretation, professional judgement.

Analysis of 34 FCA Regulations:

pie title "FCA Regulations by Ambiguity"
    "Ambiguous Terms (60%)" : 60
    "Purely Quantitative (40%)" : 40

  • 60% Ambiguous Terms
  • 40% Purely Quantitative

The 40% We Can Automate:

  • “≥80% of fund assets must be equities”
  • “Net asset value calculated daily”
  • Clear numerical thresholds

The 60% That Resist Automation:

  • Reasonable steps to verify…”
  • Appropriate controls for risk…”
  • Material changes disclosed…”
  • Predominantly invested in…”

This Isn’t Bad Drafting

This Is Essential Design

These ambiguous terms serve critical regulatory functions:

  • Flexibility across diverse business models
  • Adaptability as markets evolve
  • Professional judgement for contextually appropriate application

Recent computational law research argues that attempting to eliminate ambiguity through rigid formalization would either produce unworkable specificity or drain rules of practical meaning.

Regulation requires some ambiguity.

The Solution – Graduated Automation

Route by Rule Ambiguity, Not Model Confidence

Given inverse calibration (confidence scores mislead) and pervasive ambiguity (60% of rules resist full automation), we developed an ambiguity-aware routing framework.

Why Ambiguity Matters

Legal Ambiguity Serves Essential Functions

Legal scholars recognise ambiguity isn’t bad drafting—it’s essential design:

  • Permits contextual application across diverse business models
  • Allows evolutionary interpretation as markets change
  • Preserves professional judgement for contextually appropriate application

We don’t try to eliminate ambiguity—we design around it.

Three-Tier System

Example: “Fund must hold ≥80% equity securities”

  • Characteristics: Purely quantitative thresholds, clear definitions
  • Processing: AI extraction → automated validation → no human review
  • Coverage: ~40% of FCA rules
  • Risk: Low (clear requirements, verifiable outcomes)

Example: “Fund predominantly invested in equity securities”

  • Characteristics: Partial specification, requires interpretation
  • Processing: AI extraction → human verification → expert approval
  • Coverage: ~20% of FCA rules
  • Risk: Medium (AI suggests, human validates)

Example: “Fund employs appropriate controls for liquidity risk”

  • Characteristics: Professional standards, firm-specific context
  • Processing: Human interpretation → expert encoding → peer review
  • Coverage: ~40% of FCA rules
  • Risk: High (requires professional judgement)

Why This Works

❌ Traditional (Fails):

IF model.confidence > 0.8:
    automate()  
    # ⚠️ Routes worst predictions!
ELSE:
    human_review()

✅ Our Approach (Robust):

ambiguity = measure_text(text)

IF ambiguity < 0.3:
    automate()  # Safe
ELIF ambiguity < 0.6:
    hybrid_processing()
ELSE:
    human_review()  # Preserve judgement

Key Advantage

Doesn’t rely on unreliable AI self-assessment. Routes based on measurable properties of regulatory text that we can verify independently.

Impact & Deployment

Real-World Efficiency Gains With Safety Maintained

Projected Performance (1,000 regulatory submissions):

Tier Rules Processing Cost/Sub Human Effort
Automatic 40% AI only $1 0%
Hybrid 20% AI + verification $50 10-20 min
Human 40% Expert review $200 Full effort
CURRENT 100% Manual $300 100%
PROPOSED 100% Graduated $110 ~40%

Cost-Benefit & Safety

Per Institution (10,000 checks/year):

  • Current cost: ~$3M annually
  • Projected cost: ~$1.1M annually
  • Projected savings: ~$1.9M/year

UK Financial Sector (1,000+ firms):

  • Sector-wide efficiency: hundreds of millions annually
  • (Caveat: Requires industry pilot validation)

Safety Properties:

✅ High-risk ambiguous rules get human review

✅ Quality maintained through graduated oversight

✅ Transparent routing rationale

✅ Robust to calibration failures

What Makes This Different

The Contributions

EMPIRICAL DISCOVERY

The Confidence Trap Documented

  • First quantitative evidence of inverse calibration in RegTech
  • 4 of 6 architectures: r < -0.35
  • Models systematically most confident when most wrong
  • Implication: Naive confidence-based routing would be dangerous

Medical AI shows similar problems but not for regulatory compliance

METHODOLOGICAL RIGOR

Fair Symbolic-Neural Comparison

  • Unified evaluation framework enabling direct comparison
  • Graph edit distance (structural similarity) after IRI normalization (standardizing identifiers)
  • Structural parity finding: Both paradigms achieve ~73% similarity
  • Documents complementary trade-offs rather than dominance

Symbolic ≈ Neural (both ~73% accurate) with different strengths

PRACTICAL FRAMEWORK

Graduated Automation That Works

  • Ambiguity-aware routing robust to calibration failures
  • Routes by measurable text ambiguity, not unreliable confidence
  • Operationalizes legal theory: ambiguity serves essential purposes
  • Empirically grounded thresholds (based on FCA rule analysis)

Route by ambiguity not confidence scores

Key Takeaways

Three Essential Points for Anyone Deploying AI in Compliance

The Three Takeaways

1. 🚨 AI Can’t Reliably Tell When It’s Wrong

  • 4 of 6 architectures inversely calibrated (r < -0.35)
  • Models most confident when most wrong
  • Don’t trust confidence scores for routing decisions

2. 📊 Ambiguity Is Fundamental, Not Fixable

  • 60% of FCA regulations contain ambiguous terms
  • Not regulatory failure—essential for flexibility
  • Design around ambiguity rather than assuming better AI will eliminate it

3. ✅ Graduated Automation Works Despite Limitations

  • Route by measurable text ambiguity, not unreliable confidence
  • ~60% projected cost reduction with safety maintained
  • First deployment framework addressing inverse calibration problem

Thank You + Questions

UKFin+: Rigorous Evaluation for Responsible Deployment

Key Achievement:

Systematic empirical assessment of AI for regulatory compliance that documents both capabilities and critical limitations, enabling graduated automation despite inverse calibration.

Contact:

Professor Barry Quinn CStat, PhD
b.quinn@ulster.ac.uk

Resources:

  • Full evaluation methodology
  • Graduated automation implementation
  • Calibration analysis tools
  • Ambiguity measurement framework

Next Steps:

  • Industry pilots with financial institutions
  • Extended evaluation across regulatory domains
  • Open-source release of evaluation framework
  • Academic publication in computational law venues

Discussion Topics:

  • Inverse calibration in your domain?
  • Ambiguity profile of your regulations?
  • Architectural trade-offs for your context?
  • Workflow integration challenges?

Questions?

Thank you for your attention